Introduction to Open Data Science - Course Project

About the project

Let’s get started with this course. Right now, I’m feeling rather nice – submitted my first article earlier today. Here’s hoping it doesn’t bounce back immediately.

What I expect from this course:

I heard about this course via an email.

I thought the intro with Health Data Science worked quite nicely, though there is a lot of content to go through.

What I’m not yet sold on is R’s syntax. There seems to be rather many ways and libraries to do things, and for someone moving from Python and Pandas, none of them as intuitive. Of course, this is at least partly a matter of taste and habit.

Moving on:

# This is a so-called "R chunk" where you can write R code.
date <- format(Sys.Date(), format="%d %m %Y")

cat('hello world on', date)
## hello world on 05 12 2023

Find my repo here: https://github.com/tadusko/IODS-project

Or just here


Chapter 2: Regression and model validation

This week started our journey to regression analysis with simple and multiple linear models. Ways of validating the models graphically are also used.

Data description

This week, we’ll be working on survey data to understand connections between the attitudes and motivations of students with their performance in an exam. The data is from 2014 and is collected by Kimmo Vehkalahti. Full metadata description is provided here.

Since the original data includes tens of variables, the data has been preprocessed by collapsing the answers to many Likert-scaled claims to single answers by theme. For example, the variable “stra” represents the mean of answers related to “Study strategy”.

Let’s read in the data and explore its structure:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
df <- read_csv("data/learning2014.csv", show_col_types = FALSE)
str(df)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   points = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
dim(df)
## [1] 166   7

Interpretation:

Cool! There are 166 observations (rows) and 7 variables (columns). The variables are all numerical save for gender, which is a binary character (F/M). Points tells the exam performance of that student and is our target variable.

Data overview

Let’s start with a more thorough exploration of the variables. It’s important to know the distribution of the variables and their interconnections. As a reminder, the task is to predict exam performance from survey results that try to capture the study strategies and attitudes of students.

Let’s therefore create a graphical representation of the variables and a summary statistics table.

# calling the GGally and ggplot2 libraries
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

# creating a plot matrix to explore the distribution of the variables and the relationships between them.
p <- ggpairs(df, mapping = aes(), title="Relationships between the variables", progress = FALSE, lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p

# summary stats of the numerical columns
summary(dplyr::select(df, -c(gender)))
##       age           attitude          deep            stra      
##  Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##  Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##  Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##  3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##  Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00
# counts per gender
table(df$gender)
## 
##   F   M 
## 110  56

Interpretation:

Starting with the background variables (age & gender), the participants skew young (median: 22, mean: 26). There are about twice as many women as men in the data.

Looking at histograms, many of the survey result variables (attitude, deep, stra, surf), are approximately normally distributed. The variables are mostly not correlated with each other.

The rightmost columns shows correlations between exam points and background/survey variables. All except one are not meaningful: attitude towards statistics is positively correlated with exam performance (Pearson correlation coefficient: 0.437). Strategic learning and surface learning related questions yield the next highest correlations (0.146 and -0.144, respectively).

Creating a multiple regression model

Based on the correlation coefficients, let’s build a regression model with three explanatory variables: attitude towards statistics (ATTITUDE), strategic learning questions (STRA) and surface learning questions (SURF).

# create a regression model with multiple explanatory variables
multi_model <- lm(points ~ attitude + stra + surf, data = df)

# summarize model
summary(multi_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Interpretation and explanation

This model does not work particularly well: multiple R-squared is only 0.2074 and adjusted 0.1927. R-squared values basically quantify how much of the variation in point outcomes are captured by the explanatory variables.

Similarly to the previous correlations, only ATTITUDE is statistically significant. Therefore, let’s fit another model without the non-significant variables.

A simpler regression model

# create a regression model with the significant variable
model <- lm(points ~ attitude, data = df)

# summarize model
summary(model)
## 
## Call:
## lm(formula = points ~ attitude, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Interpretation and explanation

While multiple R-squared is slightly lower, this is not a cause for concern. After all, including even spurious variables at a model will explain some of its noise.

Multiple R-squared for a model with only one explanatory variable is basically the square of the correlation coefficient (=0.437²)

In this model, the explanatory variable is significant. Let’s continue using it.

Model diagnostics

Finally, we ought to make sure the model does not violate the assumptions of normality and constant variance.

For this purpose, let’s plot the residuals against the fitted values, residuals in a Quantile-Quantile plot and Residuals vs Leverage.

par(mfrow = c(2,2))

# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
plot(model, which=c(1,2,5))

Interpretation

Starting from Residuals vs Fitted, we are looking for roughly a similar jitter pattern all throughout the plot. While a few outliers exist, it seems good overall. The constant variance assumption is not violated.

Q-Q plot should form approximately a straight line. Again, that is achieved. The data follows roughly a normal distribution.

Leverage tells us how much influence individual observations have on the model fit. I don’t see anything too alarming.

Summary: the model does not violate any underlying assumptions. However, it does not explain the phenomena (exam performance) that well, either.


Chapter 3: Logistic regression

This week we dive into logistic regression. It’s as an approach that can be applied to a binary response variable. Since the previously used linear regression models have certain assumptions, such as normality assumption, we cannot use them here. Instead, we’ll rely on Generalized Linear Models (GLM). The output of the model will be a probability that the response variable is either true or false.

What are GLM’s? Let’s work through an exercise to find out. The task is to examine the high alcohol consumption and its interplay with various socioeconomic variables. A succesful model will be able to predict high alcohol consumption based on other variables tied to the individual.

Data description

We use a dataset of Portuguese secondary school students collected by combining school records and questionnaires. The raw data is created by Paolo Cortez and it is available at this link under CC BY 4.0. Other metadata can also be found at the link.

I have preprocessed the data by joining the two distinct datasets in the original file (math and Portuguese performance). In addition, I have created a new column (alc_use) that represents the mean alcohol use of students on a five-point scale from very low to very high. From that, I’ve derived a binary variate (high_use), which is true if the mean alcohol use is over 2.

library(dplyr)
library(readr)
alc <- read_csv("data/alc.csv", show_col_types = FALSE)
str(alc)
## spc_tbl_ [370 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ school    : chr [1:370] "GP" "GP" "GP" "GP" ...
##  $ sex       : chr [1:370] "F" "F" "F" "F" ...
##  $ age       : num [1:370] 18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr [1:370] "U" "U" "U" "U" ...
##  $ famsize   : chr [1:370] "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr [1:370] "A" "T" "T" "T" ...
##  $ Medu      : num [1:370] 4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : num [1:370] 4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr [1:370] "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr [1:370] "teacher" "other" "other" "services" ...
##  $ reason    : chr [1:370] "course" "course" "other" "home" ...
##  $ guardian  : chr [1:370] "mother" "father" "mother" "mother" ...
##  $ traveltime: num [1:370] 2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : num [1:370] 2 2 2 3 2 2 2 2 2 2 ...
##  $ schoolsup : chr [1:370] "yes" "no" "yes" "no" ...
##  $ famsup    : chr [1:370] "no" "yes" "no" "yes" ...
##  $ activities: chr [1:370] "no" "no" "no" "yes" ...
##  $ nursery   : chr [1:370] "yes" "no" "yes" "yes" ...
##  $ higher    : chr [1:370] "yes" "yes" "yes" "yes" ...
##  $ internet  : chr [1:370] "no" "yes" "yes" "yes" ...
##  $ romantic  : chr [1:370] "no" "no" "no" "yes" ...
##  $ famrel    : num [1:370] 4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : num [1:370] 3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : num [1:370] 4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : num [1:370] 1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : num [1:370] 1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : num [1:370] 3 3 3 5 5 5 3 1 1 5 ...
##  $ failures  : num [1:370] 0 0 2 0 0 0 0 0 0 0 ...
##  $ paid      : chr [1:370] "no" "no" "yes" "yes" ...
##  $ absences  : num [1:370] 5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : num [1:370] 2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : num [1:370] 8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : num [1:370] 8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num [1:370] 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi [1:370] FALSE FALSE TRUE FALSE FALSE FALSE ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   school = col_character(),
##   ..   sex = col_character(),
##   ..   age = col_double(),
##   ..   address = col_character(),
##   ..   famsize = col_character(),
##   ..   Pstatus = col_character(),
##   ..   Medu = col_double(),
##   ..   Fedu = col_double(),
##   ..   Mjob = col_character(),
##   ..   Fjob = col_character(),
##   ..   reason = col_character(),
##   ..   guardian = col_character(),
##   ..   traveltime = col_double(),
##   ..   studytime = col_double(),
##   ..   schoolsup = col_character(),
##   ..   famsup = col_character(),
##   ..   activities = col_character(),
##   ..   nursery = col_character(),
##   ..   higher = col_character(),
##   ..   internet = col_character(),
##   ..   romantic = col_character(),
##   ..   famrel = col_double(),
##   ..   freetime = col_double(),
##   ..   goout = col_double(),
##   ..   Dalc = col_double(),
##   ..   Walc = col_double(),
##   ..   health = col_double(),
##   ..   failures = col_double(),
##   ..   paid = col_character(),
##   ..   absences = col_double(),
##   ..   G1 = col_double(),
##   ..   G2 = col_double(),
##   ..   G3 = col_double(),
##   ..   alc_use = col_double(),
##   ..   high_use = col_logical()
##   .. )
##  - attr(*, "problems")=<externalptr>
dim(alc)
## [1] 370  35

There are 370 observations and 35 variables – an extensive set of socioeconomic background variables! Check out the dataset’s metadata for more info on the variables

Hypothesising

Let’s pick four variables of interest and examine their connection to alcohol consumption. Below are my picks and my hypotheses on their connection to alcohol:

  • Sex [binary]: I assume men will drink more on average.
  • Traveltime [numeric, less–more 1–5]: Could be a proxy for many things; e.g. family wealth, interest in coming to the school if arriving from far away. I’ll wager higher travel time is linked to higher consumption.
  • Studytime [numeric, 1–5]: More time to study, less time to party?
  • goout [numeric, 1–5]: More outgoing folks will drink more.

Exploration of variables

Let’s look into the variables and their relationship with high alcohol consumption in detail to see if the hypotheses holds any water!

First, some simple bar plots. First plot simply shows the distribution of each variable. Then, each plot is divided into two: they show the distribution on that variable when high_use is True or False.

library(ggplot2)
library(tidyr)
library(purrr)

# extracting only the variables of interest to a new dataframe
extr <- alc[,c("sex", "traveltime", "studytime", "goout")]

# histograms
gather(extr) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

# printing out multiple plots was way harder than it should be
# anyway, this should do
walk(names(extr), ~{
  g <- ggplot(data = alc, aes(x = !!sym(.)))
  g <- g + geom_bar() + facet_wrap(~high_use)
  print(g)
})

Interpretation

The plots show us at least that that men are more prevalent among the high users. In addition, the distribution for outgoing people between high users and others is different enough to be significant. However, for the other values, bar plots may not be the best approach.

Let’s therefore tabulate and group by the binary variables (high_use, sex) and calculate the mean values for each numerical subgroup.

alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_travel = mean(traveltime), mean_study = mean(studytime), mean_out = mean(goout))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 6
## # Groups:   sex [2]
##   sex   high_use count mean_travel mean_study mean_out
##   <chr> <lgl>    <int>       <dbl>      <dbl>    <dbl>
## 1 F     FALSE      154        1.38       2.34     2.95
## 2 F     TRUE        41        1.46       2        3.39
## 3 M     FALSE      105        1.38       1.90     2.70
## 4 M     TRUE        70        1.67       1.63     3.93

Interpretation

Interesting! We see evidence to support all of the hypotheses – male sex, long travel time, low study time and an outgoing personality indicate high use. However, the differences for mean travel and study times are quite small. A statistical test would be needed to say anything with more confidence.

Logistic regression

Now, to make a GLM with the variables we have. As a reminder, the model tries to predict a binary value: whether alcohol use is high or not.

# creating the model
m <- glm(high_use ~ goout + traveltime + studytime + sex, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ goout + traveltime + studytime + sex, 
##     family = "binomial", data = alc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.2372     0.6459  -5.012 5.39e-07 ***
## goout         0.7455     0.1200   6.212 5.24e-10 ***
## traveltime    0.3422     0.1795   1.907  0.05657 .  
## studytime    -0.4703     0.1704  -2.760  0.00579 ** 
## sexM          0.7021     0.2639   2.660  0.00780 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 376.10  on 365  degrees of freedom
## AIC: 386.1
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept)       goout  traveltime   studytime        sexM 
##  -3.2371999   0.7454873   0.3422454  -0.4702584   0.7021031

Interpretation

Other than travel time, all the variables are significant (< 0.05). They affect the model in different ways: for example, an increase in study time decreases the probability of high use, whereas the effect is opposite for increasing outgoingness.

Next, let’s interpret the model coefficients as odds ratios and calculate confidence intervals for them.

# find the model with glm()
m <- glm(high_use ~ goout + traveltime + studytime + sex, data = alc, family = "binomial")

# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR      2.5 %    97.5 %
## (Intercept) 0.03927371 0.01061421 0.1346865
## goout       2.10746808 1.67636472 2.6864367
## traveltime  1.40810576 0.99222702 2.0108276
## studytime   0.62484081 0.44291107 0.8658021
## sexM        2.01799228 1.20598329 3.4012264

Interpretation

Again, we see that travel time is not significant: its confidence interval includes 1, therefore it doesn’t make a difference what the travel time is for the probabilities.

Other than that, see for example the effect of male sex: it is about twice as likely (CI: 1.21, 3.4) to be a high alcohol user if one is male.

Exploring the predictive power of the model

Finally, let’s see if the model does what we claimed it does. For that, let’s use it to predict the outcome for each student and then make a confusion matrix.

# let's drop the insignificant variable
m <- glm(high_use ~ goout + studytime + sex, data = alc, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
print(table(high_use = alc$high_use, prediction = alc$prediction))
##         prediction
## high_use FALSE TRUE
##    FALSE   234   25
##    TRUE     58   53
# tabulate the target variable versus the predictions; probabilities
print(table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins())
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.63243243 0.06756757 0.70000000
##    TRUE  0.15675676 0.14324324 0.30000000
##    Sum   0.78918919 0.21081081 1.00000000

Interpretation

Hmm, not bad, but the model seems to make awful many false negative predictions (n=58, 16%).

Better than a guess?

The model should at least do better than a naïve guessing. Let’s try this by calculating a loss function for the predictions and comparing it to how many errors would be caused if every single case is defined as False or True.

# define a loss function (mean prediction error)
loss_func <- function(class, prob, threshold, equals) {
  if (equals==TRUE){
    n_wrong <- abs(class - prob) = threshold
  } else {
    n_wrong <- abs(class - prob) > threshold
  }
  print(mean(n_wrong))
}

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

#loss_f(class = alc$high_use, prob = alc$probability)
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2243243
loss_func(class = alc$high_use, prob = 0)
## [1] 0.3
loss_func(class = alc$high_use, prob = 1)
## [1] 0.7

Interpretation

Assuming everyone is non-high user gives an error rate of 0.3 – compared to the error rate of ~0.23 of the model. Therefore, the model at least outperforms a naïve baseline.

K-fold cross validation

K-fold cross validation is a type of validation method in which the input data is split into training and test datasets, which are rotated. The aim is describe model performance in a way that is less susceptible to overfitting and randomness than just training and testing on the whole dataset.

library(boot)

# doing the cross validation – K=number of rows=370
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = nrow(alc))

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2648649

Interpretation

THe average error rate of ~0.26 is a bit worse than the whole-dataset error rate of ~0.23. It is no better than the error rate of the model explored in the exercises this week.

Chapter 3 summary

This week, we explored logistic regression, odds ratios and (cross) validation. Working through an example dataset showed how a binary variable (high alcohol consumption or not) could be predicted with a GLM using other variables. The performance of the model could then be quantified and validated. The model outdid a naïve baseline but could probably be finetuned with more fitting variables. 3/4 hypotheses I made previously held quite nicely! Only the travel time, which could indicate a great many things, was not useful for prediction.


Chapter 4: Clustering and classification

It’s common to examine data through, e.g., socio-economic and societal premade classes that are fit into data – male and female, young and old etc. Should we be bound to these or could we let suitable classes emerge from the features of the data?

This week, we’ll delve into methods for creating data clusters that are alike in some (meaningful) ways. Once the clusters are created, the classes may be labelled and new observations may be classified into them.

Tasks 2–3: Data exploration

Description of the data

The dataset we’ll be using is provided through the R package ‘MASS’. The ‘Boston’ consists of various descriptive value sof Boston, Massachusetts, such as crime rate per capita, nitrogen oxide concentration rates and apartment values. Full metadata can be found here.

Let’s peek at the data.

suppressMessages(library(MASS))
suppressMessages(library(dplyr))

# load the data
bost <- Boston

# explore the dataset
str(bost)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(bost)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# plot matrix of the variables
#pairs(bost)

The dataframe has 506 rows and 14 columns. The variable types differ quite a bit: some are counts, others e.g. population normalized values and yet one (‘chas’) a binary dummy variable.

Distributions and correlations

The multitude of variables makes simple interpretations rather tough. Let’s check out a mix of histograms and scatter plots anyway.

# needed for histograms
# from pairs plot documentation: https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/pairs.html
panel.hist <- function(x, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(usr[1:2], 0, 1.5) )
    h <- hist(x, plot = FALSE)
    breaks <- h$breaks; nB <- length(breaks)
    y <- h$counts; y <- y/max(y)
    rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}
# plot matrix of the variables
pairs(Boston, cex = 0.05, diag.panel=panel.hist, upper.panel=NULL)
## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

Interpretation

For example crime rate and residential land rate variables are heavily left-skewed (they have a lot of low values). Others are closer to normal (‘lstat’) or have other distributions (‘rad’).

Seemingly some of the correlate, but for a proper overview, let’s print out a correlation matrix and a correlation plot.

cor_matrix <- cor(Boston) %>% round(digits=2)
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# print the correlation matrix


# visualize the correlation matrix
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b",tl.pos = "d",tl.cex = 0.6)

Interpretation

Crime rate has correlates positively with, e.g., property tax rates (‘tax’) and negative ones with, e.g., the proportion of Black population. In general, many of the variables are correlated with each other, which is not surprising, considering the interlinkages of, for example, property values and crime.

Task 4: Standardization and other data wrangling

As mentioned before, the variables are not directly comparable right now. Let’s use the scaling functionality in R to make them so.

# center and standardize variables
boston_scaled <- scale(Boston)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Alright! All the values are roughly centered around 0 now (roughly varying between -5–5).

Quantiles and a categorical variable

We will try to examine crime through the lens of four classes (low, medium low, medium high and high). For that, we’ll need to create a factor variable from a continuous one – in this case, by splitting the crime variable into four equally sized bins.

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, label=c("low", "med_low", "med_high", "high"), include.lowest = TRUE)

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Train and test sets

A random 80 % of the data is used for fitting a linear discriminant model. The rest are set aside for testing.

# number of rows in the Boston dataset 
n <- nrow(Boston)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

Task 5. Linear discriminant analysis

The point in linear discriminant analysis is to classify an observation to a class based on continuous variables. In this case, we’ll use all other variables to find a combination of variables that discerns the crime rate classes.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2376238 0.2475248 0.2425743 0.2722772 
## 
## Group means:
##                   zn      indus         chas        nox          rm        age
## low       1.01211968 -0.9159374 -0.108283225 -0.8845468  0.44156484 -0.8761050
## med_low  -0.05225164 -0.3185811  0.003267949 -0.5735881 -0.10495672 -0.3913445
## med_high -0.39798574  0.2044663  0.169590347  0.3620007  0.06713718  0.3831161
## high     -0.48724019  1.0169558 -0.021786325  1.0371080 -0.40053861  0.8119692
##                 dis        rad        tax     ptratio      black       lstat
## low       0.9044029 -0.6899692 -0.7623671 -0.42933733  0.3804891 -0.76470026
## med_low   0.3970287 -0.5454537 -0.4467096 -0.08154182  0.3140611 -0.16223964
## med_high -0.3507105 -0.4217006 -0.3130323 -0.25330464  0.1226699  0.03120756
## high     -0.8555861  1.6397657  1.5152267  0.78268316 -0.8710819  0.90600618
##                  medv
## low       0.504858410
## med_low   0.006327391
## med_high  0.171399316
## high     -0.712485828
## 
## Coefficients of linear discriminants:
##                 LD1         LD2        LD3
## zn       0.13078096  0.74993925 -1.0086449
## indus    0.08101934 -0.33295805  0.1070789
## chas    -0.06615964 -0.01676615  0.1930100
## nox      0.34641925 -0.75010520 -1.2913756
## rm      -0.10248085 -0.07249731 -0.1178015
## age      0.20384032 -0.24464865 -0.2179340
## dis     -0.09810883 -0.29516819  0.2263243
## rad      3.47905947  0.90916759 -0.3527900
## tax     -0.06513463  0.07741802  1.0168652
## ptratio  0.12570572 -0.05891239 -0.3263691
## black   -0.11816944  0.03073197  0.1019807
## lstat    0.26756961 -0.33577566  0.2713138
## medv     0.21001841 -0.54118729 -0.3111143
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9559 0.0334 0.0107
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)

Interpretation

The biplot shows two clear clusters after the data is collapsed. One on the right seems to have mostly high values – but a lot of noise, too.

Task 6. Classification with LDA

Let’s formally test the model.

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       18      10        3    0
##   med_low    2      18        6    0
##   med_high   0       7       19    2
##   high       0       0        0   17

Interpretation

The majority of predictions are correct for each class. More extreme values (low and high) are seemingly easier to classify than middle-ground values.

Task 7. K-means

Standardization and distance measures

After again scaling the raw values, distance matrices are built – below, two algorithms (Euclidean and Manhattan distance) are used for the task. They give mildly differing distance values, but let’s use the default Euclidean measure.

boston_scaled <- scale(Boston)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# euclidean distance matrix
dist_eu <- dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method='manhattan')

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

Clustering

On to clustering. The K-means algorithms wants to know the number of clusters it creates. Let’s start of randomly with 4.

library(ggplot2)
set.seed(42)

# k-means clustering
km <- kmeans(boston_scaled, centers = 4)

# plot the Boston dataset with clusters
# split into two figures for readability
pairs(boston_scaled[1:7],cex=0.5, col = km$cluster)

pairs(boston_scaled[7:14],cex=0.5, col = km$cluster)

To make the analysis more robust, we’ll test out the inner cohesion of the clusters with differing number of centers. Seemingly, 2 is optimal, since the most radical change in the WCSS value happens at that point.

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# k-means clustering
km <- kmeans(boston_scaled, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled,cex=0.5, col = km$cluster, upper.panel=NULL)

# split into two for readability
pairs(boston_scaled[1:7],cex=0.5, col = km$cluster, upper.panel=NULL)

pairs(boston_scaled[7:14],cex=0.5, col = km$cluster, upper.panel=NULL)

Interpretation

With the scaled data, many of the variables seem to form horizontal and vertical “lines” at certain values. I wonder how much that affects clustering. Anyways, visual examination shows that the two clusters seem to be mostly sensible across the variables.


Chapter 5: Dimensionality reduction techniques

Data scarcity has a twin brother, data deluge. Sometimes we have multiple datasets to choose from, or many variables. These variables may in addition be linked to each other – for example, educational level attained and income are correlated.

This week, we’ll examine methods exploratory methods created for distilling relevant information from many variables to a few. These dimensionality reduction techniques, such as principal component analysis, return fewer columns that try to capture the variation in the original variables.

1. Data description and correlations

This week’s data has been fuzed from two datasets with socioeconomic variables describing countries: human development index (HDI) and gender inequality index (GII). The data has is offered by the United Nations Development Programme (UNDP). See full metadata description here and refer to the data wrangling script for further information on how the data has been modified and what the column names refer to.

Let’s read in the data and set country names as index.

library(tibble)
library(GGally)
library(readr)
library(dplyr)

human <- read_csv("data/human.csv")
## Rows: 155 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): Edu2.FM, Labo.FM, Life.Exp, Edu.Exp, GNI, Mat.Mor, Ado.Birth, Parli.F
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
human <- column_to_rownames(human, "Country")

Good. Then some data exploration.

summary(human)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
ggpairs(
 human, progress=F,
 upper = list(continuous = wrap("density", alpha = 0.5), combo = "box"),
 lower = list(continuous = wrap("points", alpha = 0.3,    size=0.1), 
              combo = wrap("dot", alpha = 0.4,            size=0.2))
)

Educational variables and share of female parliamentary representation are roughly normally distributed, whereas the rest are skewed left (life expectancy) or right (e.g., gross national income (GNI), adolescent birth rate). Many of the values seem to be correlated based on the scatter plots, but let’s confirm this hunch with a correlation plot.

# Access corrplot
library(corrplot)

# mark insignificant correlations
testRes = cor.mtest(human, conf.level = 0.95)

# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot(p.mat = testRes$p, sig.level = 0.05, addCoef.col ='black', number.cex = 0.75, insig='blank', type = 'lower')

The educational variables are highly correlated with each other, life expectancy, GNI (positive) and maternal mortality and adolescent birth rates (negative). Expectedly, less maternal mortality and adolescent births is related to wealthier countries (GNI). Female labor participation and parliamentary representations are only weakly, if at all, correlated with the other variables.

2. PCA on non-standardized data

Let’s get on with the PCA. Below is an example of how to not do it. With data that is unstandardized, PCA is dominated by the variables with the largest variances. In this data, that is GNI, which is in dollars / population – much larger values than the percentage shares or ages of the other variables.

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2,cex = c(0.5, 1.2), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

# create and print out a summary of pca_human
s <- summary(pca_human)
print(s)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
# rounded percentages of variance captured by each PC
pca_pr <- round(1*s$importance[2, ]*100, digits = 1)

# print out the percentages of variance
print(pca_pr)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0

As a consequence of the dominance of GNI, the principal component mostly describes the variation of GNI, not of the other variables. Standardization is needed.

3. PCA on standardized data

The values are scaled towards 0 – first, mean of that variable is subracted from each value, the each value is divided by the standard deviation of that variable.

# standardize the variables
human_std <- scale(human)

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)

# create and print out a summary of pca_human
s <- summary(pca_human)
print(s)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
# rounded percentages of variance captured by each PC
pca_pr <- round(1*s$importance[2, ]*100, digits = 1)

# print out the percentages of variance
print(pca_pr)
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2,cex = c(0.5, 1.0), col = c("grey40", "deeppink2"))

Standardization makes all the difference for the reasons described above. Now we see that the variables that correlated are close to each other on the X axis (education, income, life expectancy) or opposite with negative correlation (maternal age and health). The two variables that were not as related to the others are on a component of their own.

4. Interpreting the components

Although PC interpretation is often not useful – it’s educated guesswork, in essence – let’s try to see how we could interpret the first to principal components and their scatterplot.

Because the first PC seems to be broadly related to people’s health and income levels (positive or negative), I’ll name it “Health and wellbeing”. The second PC gets the label “Female societal participation” after the variables describing parliamentary share and labour participation.

descriptions <- c("PC1: Health and wellbeing", "PC2: Female societal participation")
pc_lab <- paste0(descriptions, " (", pca_pr[1:2], "%)")

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2,cex = c(0.5, 1.0), col = c("grey40", "deeppink2"),  xlab = pc_lab[1], ylab = pc_lab[2])

Interpreted through these labels, wellbeing nations (such as the Nordic countries) cluster to left and top (high health and female societal participation). On the other hand, countries with low female societal status, such as Syria and Iran are placed near bottom of the plot. Developing countries with worse healthcare systems and societal safety nets and thus high maternal mortality and many adolescents giving birth are placed towards the right of the plot.

5. Tea & Multiple correspondence analysis

Now for something completely different. We’ll examine Multiple correspondence analysis, a method related to PCA but applicable to categorical data.

The example data originates from the FactoMineR package – it’s 300 answers related to tea. 18 questions relate to people’s tea drinking habits, 12 to perception of a product and 6 are background questions.

library(FactoMineR)

tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

dim(tea)
## [1] 300  36
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   +60  :38   +2/day     :127  
##  middle      :40   sportsman    :179   15-24:92   1 to 2/week: 44  
##  non-worker  :64                       25-34:69   1/day      : 95  
##  other worker:20                       35-44:40   3 to 6/week: 34  
##  senior      :35                       45-59:61                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 
# The code below prints bar plots for each variable
# uncomment to run

#factor_variable_names <- names(Filter(is.factor, tea))

#for (variable in factor_variable_names) {
#  barplot(table(tea[[variable]]), main = variable, col = "skyblue", cex.main = 0.8)
#}

Let’s filter the data to keep only those columns that relate to tea drinking times and one for sex.

# keep only five columns: sex and those related to tea consumption times.
tea_when  <-  tea[c("breakfast","tea.time","lunch","dinner", "sex")]

# multiple correspondence analysis
mca <- MCA(tea_when, graph = F)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_when, graph = F) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5
## Variance               0.303   0.214   0.185   0.161   0.136
## % of var.             30.276  21.425  18.542  16.147  13.610
## Cumulative % of var.  30.276  51.701  70.244  86.390 100.000
## 
## Individuals (the 10 first)
##                  Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1             |  0.414  0.189  0.210 |  0.736  0.842  0.663 | -0.268  0.129
## 2             | -0.019  0.000  0.001 |  0.212  0.070  0.068 | -0.346  0.215
## 3             |  0.598  0.394  0.113 | -1.029  1.649  0.334 |  0.250  0.112
## 4             |  1.548  2.639  0.700 | -0.298  0.138  0.026 |  0.412  0.306
## 5             |  0.414  0.189  0.210 |  0.736  0.842  0.663 | -0.268  0.129
## 6             |  1.548  2.639  0.700 | -0.298  0.138  0.026 |  0.412  0.306
## 7             | -0.103  0.012  0.015 |  0.528  0.433  0.390 | -0.353  0.224
## 8             | -0.276  0.084  0.145 | -0.626  0.610  0.745 | -0.102  0.019
## 9             | -0.103  0.012  0.015 |  0.528  0.433  0.390 | -0.353  0.224
## 10            |  0.414  0.189  0.210 |  0.736  0.842  0.663 | -0.268  0.129
##                 cos2  
## 1              0.088 |
## 2              0.181 |
## 3              0.020 |
## 4              0.050 |
## 5              0.088 |
## 6              0.050 |
## 7              0.175 |
## 8              0.020 |
## 9              0.175 |
## 10             0.088 |
## 
## Categories
##                   Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## breakfast     |  -0.371   4.367   0.127  -6.166 |   0.758  25.768   0.531
## Not.breakfast |   0.343   4.032   0.127   6.166 |  -0.700  23.786   0.531
## Not.tea time  |   0.802  18.541   0.498  12.205 |   0.271   3.003   0.057
## tea time      |  -0.621  14.372   0.498 -12.205 |  -0.210   2.328   0.057
## lunch         |  -0.994   9.568   0.170  -7.124 |   0.648   5.753   0.072
## Not.lunch     |   0.171   1.645   0.170   7.124 |  -0.111   0.989   0.072
## dinner        |   2.238  23.156   0.377  10.616 |  -0.868   4.922   0.057
## Not.dinner    |  -0.168   1.743   0.377 -10.616 |   0.065   0.370   0.057
## F             |  -0.484   9.181   0.342 -10.109 |  -0.493  13.453   0.354
## M             |   0.706  13.396   0.342  10.109 |   0.719  19.629   0.354
##                v.test     Dim.3     ctr    cos2  v.test  
## breakfast      12.599 |  -0.368   6.996   0.125  -6.107 |
## Not.breakfast -12.599 |   0.339   6.458   0.125   6.107 |
## Not.tea time    4.132 |   0.103   0.497   0.008   1.564 |
## tea time       -4.132 |  -0.080   0.385   0.008  -1.564 |
## lunch           4.647 |   2.089  69.026   0.750  14.975 |
## Not.lunch      -4.647 |  -0.359  11.864   0.750 -14.975 |
## dinner         -4.117 |   0.705   3.758   0.037   3.347 |
## Not.dinner      4.117 |  -0.053   0.283   0.037  -3.347 |
## F             -10.294 |  -0.068   0.298   0.007  -1.425 |
## M              10.294 |   0.100   0.435   0.007   1.425 |
## 
## Categorical variables (eta2)
##                 Dim.1 Dim.2 Dim.3  
## breakfast     | 0.127 0.531 0.125 |
## tea.time      | 0.498 0.057 0.008 |
## lunch         | 0.170 0.072 0.750 |
## dinner        | 0.377 0.057 0.037 |
## sex           | 0.342 0.354 0.007 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")

What does this plot tell us? I’m not quite sure how to interpret it, to be honest. At least we saw previously that few people drink tea with dinner – those that do seem to be the outlier in this plot, too.